NB: The worksheet has beed developed and prepared by Maxim Romanov for the course “R for Historical Research” (U Vienna, Spring 2019).
We will use the following text files in this worksheet. Please download them and keep them close to your worksheet. Since some of the files are quite large, you want to download them before loading them into R:
In order to make loading these files a little bit easier, you can paste the path to where you placed these files into an isolated variable and then reuse it as follows:
pathToFiles = "./data/"
d1861 <- read.delim(paste0(pathToFiles, "dispatch_1861.tsv"), encoding="UTF-8", header=TRUE, quote="")
#d1862 <- read.delim(paste0(pathToFiles, "dispatch_1862.tsv"), encoding="UTF-8", header=TRUE, quote="")
sw1 <- scan(paste0(pathToFiles, "sw1.md"), what="character", sep="\n")The first two files are articles from “The Daily Dispatch” for the years of 1861 and 1862. The newspaper was published in Richmond, VA — the capital of the Confederate States (the South) during the American Civil War (1861-1865). The last file is a script of the first episode of Star Wars :). In fact, for now, we only need one file of the Dispatch.
The following are the libraries that we will need for this section. Install those that you do not have yet.
#install.packages("tidyverse", "readr", "stringr")
#install.packages("tidytext", "wordcloud", "RColorBrewer"", "quanteda", "readtext")
# General ones
library(tidyverse)
library(readr)
library("RColorBrewer")
# text analysis specific
library(stringr)
library(tidytext)
library(wordcloud)
library(quanteda)
library(readtext)R (a refresher)Functions are groups of related statements that perform a specific task, which help breaking a program into smaller and modular chunks. As programs grow larger and larger, functions make them more organized and manageable. Functions help avoiding repetition and makes code reusable.
Most programming languages, R including, come with a lot of pre-defined—or built-in—functions. Essentially, all statements that take arguments in parentheses are functions. For instance, in the code chunk above, read.delim() is a function that takes as its arguments: 1) filename (or, path to a file); 2) encoding; 3) specifies that the file has a header; and 4) not using " as a special character. We can also write our own functions, which take care of sets of operations thet we tend to repeat again and again.
Later, take a look at this video by one of the key R developers, and check this tutorial.
(From Wikipedia) In geometry, a hypotenuse is the longest side of a right-angled triangle, the side opposite the right angle. The length of the hypotenuse of a right triangle can be found using the Pythagorean theorem, which states that the square of the length of the hypotenuse equals the sum of the squares of the lengths of the other two sides (catheti). For example, if one of the other sides has a length of 3 (when squared, 9) and the other has a length of 4 (when squared, 16), then their squares add up to 25. The length of the hypotenuse is the square root of 25, that is, 5.
Let’s write a function that takes lengths of catheti as arguments and returns the length of hypothenuse:
hypothenuse <- function(cathetus1, cathetus2) {
hypothenuse<- sqrt(cathetus1*cathetus1+cathetus2*cathetus2)
print(paste0("In the triangle with catheti of length ",
cathetus1, " and ", cathetus2, ", the length of hypothenuse is ", hypothenuse))
return(hypothenuse)
}## [1] "In the triangle with catheti of length 390 and 456, the length of hypothenuse is 600.029999250037"
## [1] 600.03
Let’s say we want to clean up a text so that it is easier to analyze it: 1) convert everithing to lower case; 2) remove all non-alphanumeric characters; and 3) make sure that there are no multiple spaces:
clean_up_text = function(x) {
x %>%
str_to_lower %>% # make text lower case
str_replace_all("[^[:alnum:]]", " ") %>% # remove non-alphanumeric symbols
str_replace_all("\\s+", " ") # collapse multiple spaces
}text = "This is a sentence with punctuation, which mentions Vienna, the capital of Austria."
clean_up_text(text)## [1] "this is a sentence with punctuation which mentions vienna the capital of austria "
We can think of text analysis as means of extracting meaningful information from structured and unstructured texts. As historians, we often do that by reading texts and collecting relevant information by taking notes, writing index cards, summarizing texts, juxtaposing one texts against another, comparing texts, looking into how specific words and terms are used, etc. Doing text analysis computationally we do lots of similar things: we extract information of specific kind, we compare texts, we look for similarities, we look differences, etc.
While there are similarities between traditional text analysis, there are of course, also significant differences. One of them is procedural: in computational reading we must explicitely perform every step of our analyses. For example, when we read a sentence, we, sort of, automatically identify the meaningful words — subject, verb, object, etc.; we identify keywords; we parse every word, identifying what part of speeh it is, what is its lemma (i.e. its dictionary form, etc.). By doing these steps we re-construct the meaning of the text that we read — but we do most of these steps almost unconsciously, especially if a text is written in our native tongues. In computational analysis, these steps must be performed explicitely (in the order of growing complexity):
stemming, which usually means the removal of most common suffuxes and endings to get to the stem (or, root) of the word.NOTE: NLP — natural language processing.
Some examples:
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
##
## available.koRpus.lang()
##
## and see ?install.koRpus.lang()
##
## Attaching package: 'koRpus'
## The following objects are masked from 'package:quanteda':
##
## tokens, types
## The following object is masked from 'package:readr':
##
## tokenize
The library textstem does lemmatization and stemming, but only for English. Tokenization can be performed with str_split() function — and you can define how you want your string to be split.
## [[1]]
## [1] "He" "tried" "to" "open" "one" "of" "the" "bigger"
## [9] "boxes" ""
##
## [[2]]
## [1] "The" "smaller" "boxes" "did" "not" "want" "to"
## [8] "be" "opened" ""
##
## [[3]]
## [1] "Different" "forms" "open" "opens" "opened" "opening"
## [7] "opened" "opener" "openers" ""
## [1] "He try to open one of the big box."
## [2] "The small box do not want to be open."
## [3] "Different form: open, open, open, open, open, opener, opener."
## [1] "He tri to open on of the bigger box."
## [2] "The smaller box did not want to be open."
## [3] "Differ form: open, open, open, open, open, open, open."
Note: It is often important to ensure that all capital letters are converted into small letters or the other way around; additionally, some normalization procedures may be necessary to reduce orthographic complexities of specific languages (for example, ö > oe).
Let’s load all issues of Dispatch from 1862.
library(tidytext)
d1862 <- read.delim(paste0(pathToFiles, "dispatch_1862.tsv"), encoding="UTF-8", header=TRUE, quote="")
head(d1862)We can quickly check what types of articles are there in those issues.
We can create subsets of articles based on their types.
Create subsets for other major types.
Describe problems with the data set and how they can be fixed.
your answer goes here…
Now, let’s tidy them up: to work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.
test_set <- death_d1862
test_set_tidy <- test_set %>%
mutate(item_number = cumsum(str_detect(text, regex("^", ignore_case = TRUE)))) %>%
select(-type) %>%
unnest_tokens(word, text) %>%
mutate(word_number = row_number())
test_set_tidyStop words is an important concept. In general, this notion refers to the most frequent words/tokens which one might want to exclude from analysis. There are existing lists of stop words that you can find online, and they can work fine for testing purposes.
data("stop_words")
test_set_tidy_clean <- test_set_tidy %>%
anti_join(stop_words, by="word")
test_set_tidy_cleanFor research purposes, it is highly advisable to develop your own stop word lists. The process is very simple:
1 – for to exclude; 0 — for to keep. It is convenient to automatically fill the column with some default value (0), and then you can change only those that you want to remove (1).You will see that some words, depite their frequency, might be worth keeping. When you are done, you can load them and use anti_join function to filter your corpus.
Wordclouds can be an efficient way to visualize most frequent words. Unfortunately, in most cases, wordclouds are not used either correctly or efficiently. (Let’s check Google for some examples).
library(wordcloud)
library("RColorBrewer")
test_set_tidy_clean <- test_set_tidy %>%
anti_join(stop_words, by="word") %>%
count(word, sort=T)
set.seed(1234)
wordcloud(words=test_set_tidy_clean$word, freq=test_set_tidy_clean$n,
min.freq = 1, rot.per = .25, random.order=FALSE, #scale=c(5,.5),
max.words=150, colors=brewer.pal(8, "Dark2"))you answer goes here
For more details on generating word clouds in R, see: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know.
This kind of plot works better with texts rather than with newspapers. Let’s take a look at a script of Episode I:
SW_to_DF <- function(path_to_file, episode){
sw_sentences <- scan(path_to_file, what="character", sep="\n")
sw_sentences <- as.character(sw_sentences)
sw_sentences <- gsub("([A-Z]) ([A-Z])", "\\1_\\2", sw_sentences)
sw_sentences <- gsub("([A-Z])-([A-Z])", "\\1_\\2", sw_sentences)
sw_sentences <- as.data.frame(cbind(episode, sw_sentences), stringsAsFactors=FALSE)
colnames(sw_sentences) <- c("episode", "sentences")
return(sw_sentences)
}
sw1_df <- SW_to_DF(paste0(pathToFiles, "sw1.md"), "sw1")
sw1_df_tidy <- sw1_df %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(sentences, regex("^#", ignore_case = TRUE))))
sw1_df_tidy <- sw1_df_tidy %>%
unnest_tokens(word, sentences)Try names of different characters (“shmi”, “padme”, “anakin”, “sebulba”), or other terms that you know are tied to a specific part of the movie (pod, naboo, gungans, coruscant).
ourWord = "sebulba"
word_occurance_vector <- which(sw1_df_tidy$word == ourWord)
#plot(x=word_occurance_vector, type="h", )
plot(0, type='n', #ann=FALSE,
xlim=c(1,length(sw1_df_tidy$word)), ylim=c(0,1),
main=paste0("Dispersion Plot of `", ourWord, "` in SW1"),
xlab="Movie Time", ylab=ourWord, yaxt="n")
segments(x0=word_occurance_vector, x1=word_occurance_vector, y0=0, y1=2)For newspapers—and other diachronic corpora—a different approach will work better:
d1862 <- read.delim(paste0(pathToFiles, "dispatch_1862.tsv"), encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)
test_set <- d1862
test_set$date <- as.Date(test_set$date, format="%Y-%m-%d")
test_set_tidy <- test_set %>%
mutate(item_number = cumsum(str_detect(text, regex("^", ignore_case = TRUE)))) %>%
select(-type) %>%
unnest_tokens(word, text) %>%
mutate(word_number = row_number())
test_set_tidytest_set_tidy_freqDay <- test_set_tidy %>%
anti_join(stop_words, by="word") %>%
group_by(date) %>%
count(word)
test_set_tidy_freqDayWe now can build a graph of word occurences over time. In the example below we search for manassas, which is the place where the the Second Battle of Bull Run (or, the Second Battle of Manassas) took place on August 28-30, 1862. The battle ended in Confederate victory. Our graph shows the spike of mentions of Manassas in the first days of September — right after the battle took place.
Such graphs can be used to monitor discussions of different topic in chronological perspective.
# interesting examples:
# deserters, killed,
# donelson (The Battle of Fort Donelson took place in early February of 1862),
# manassas (place of the Second Bull Run, fought in August 28–30, 1862),
# shiloh (Battle of Shiloh took place in April of 1862)
ourWord = "manassas"
test_set_tidy_word <- test_set_tidy_freqDay %>%
filter(word==ourWord)
plot(x=test_set_tidy_word$date, y=test_set_tidy_word$n, type="l", lty=3, lwd=1,
main=paste0("Word `", ourWord, "` over time"),
xlab = "1862 - Dispatch coverage", ylab = "word frequency per day")
segments(x0=test_set_tidy_word$date, x1=test_set_tidy_word$date, y0=0, y1=test_set_tidy_word$n, lty=1, lwd=2)your response goes here
Keywords-in-context is the most common method for creating concordances — a view that that allows us to go through all instances of specific words or word forms in order to understand how they are used. The quanteda library offers a very quick and easy application of this method:
library(quanteda)
library(readtext)
dispatch1862 <- readtext(paste0(pathToFiles, "dispatch_1862.tsv"), text_field = "text", quote="")
dispatch1862corpus <- corpus(dispatch1862)Now, we can query the created corpus object using this command: kwic(YourCorpusObject, pattern = YourSearchPattern). pattern= can also take vectors (for example, c("soldier*", "troop*")); you can also search for phrases with pattern=phrase("fort donelson"); window= defines how many words will be shown before and after the match.
If you type View(kwic_test) in your console, an HTML table with all the results will be opened in your browser.